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Abstract 

We propose Coactive Learning as a model of 
interaction between a learning system and a 
human user, where both have the common 
goal of providing results of maximum util- 
ity to the user. At each step, the system 
(e.g. search engine) receives a context (e.g. 
query) and predicts an object (e.g. ranking). 
The user responds by correcting the system 
if necessary, providing a slightly improved - 
but not necessarily optimal - object as feed- 
back. We argue that such feedback can often 
be inferred from observable user behavior, for 
example, from clicks in web-search. Evalu- 
ating predictions by their cardinal utility to 
the user, we propose efficient learning algo- 
rithms that have 0{-^=) average regret, even 
though the learning algorithm never observes 
cardinal utility values as in conventional on- 
line learning. We demonstrate the applica- 
bility of our model and learning algorithms 
on a movie recommendation task, as well as 
ranking for web-search. 



1. Introduction 

In a wide range of systems in use today, the interac- 
tion between human and system takes the following 
form. The user issues a command (e.g. query) and re- 
ceives a - possibly structured - result in response (e.g. 
ranking). The user then interacts with the results (e.g. 
clicks), thereby providing implicit feedback about the 
user's utility function. Here are three examples of such 
systems and their typical interaction patterns: 

Web-search: In response to a query, a search engine 
presents the ranking [A, B, C, D, ...] and observes 
that the user clicks on documents B and D. 



Movie Recommendation: An online service recom- 
mends movie A to a user. However, the user rents 
movie B after browsing the collection. 

Machine Translation: An online machine transla- 
tor is used to translate a wiki page from language 
A to B. The system observes some corrections the 
user makes to the translated text. 

In all the above examples, the user provides some 
feedback about the results of the system. However, 
the feedback is only an incremental improvement, not 
necessarily the optimal result. For example, from the 
clicks on the web-search results we can infer that the 
user would have preferred the ranking [B, D, A, C, ...] 
over the one we presented. However, this is unlikely to 
be the best possible ranking. Similarly in the recom- 
mendation example, movie B was preferred over movie 
A, but there may have been even better movies that 
the user did not find while browsing. In summary, the 
algorithm typically receives a slightly improved result 
from the user as feedback, but not necessarily the op- 
timal prediction nor any cardinal utilities. We conjec- 
ture that many other applications fall into this schema, 
ranging from news filtering to personal robotics. 

Our key contributions in this paper are threefold. 
First, we formalize Coactive Learning as a model of 
interaction between a learning system and its user, 
define a suitable notion of regret, and validate the 
key modeling assumption - namely whether observ- 
able user behavior can provide valid feedback in our 
model - in a web-search user study. Second, we derive 
learning algorithms for the Coactive Learning Model, 
including the cases of linear utility models and convex 
cost functions, and show 0(l/\/T) regret bounds in 
either case with a matching lower bound. The learn- 
ing a lgorithms perform structured output prediction 
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(see (jBakir et all 120071) ) and thus can be applied in 
a wide variety of problems. Several extensions of the 
model and the algorithm are discussed as well. Third, 
we provide extensive empirical evaluations of our algo- 
rithms on a movie recommendation and a web-search 
task, showing that the algorithms are highly efficient 
and effective in practical settings. 
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2. Related Work 

The Coactive Learning Model bridges the gap between 
two forms of feedback that have been well studied 
in online learning. On one side there is the multi- 
armed bandit model ( Auer et al. . 2002bl la). where an 



algorithm chooses an action and observes the util- 
ity of (only) that action. On the other side, utili- 
ties of all possible actions ar e revealed in the case of 
learn ing with expert advice (ICesa-Bianchi &: Lugosi , 
2006) . Online convex optimization (jZinkevichl . 120031 ) 
and online conv e x opt imization in the bandit setting 



(jFlaxman et all 120051 ) are continuous relaxations of 



the expert and the bandit problems respectively. Our 
model, where information about two arms is revealed 
at each iteration sits between the expert and the ban- 
dit setting. Most closely related to Coactive Learn- 
ing is the dueling ban dits setting ( Yue et al. . 2009t 
Yue fc Joachimsi l2009h . The key difference is that 



both arms are chosen by the algorithm in the duel- 
ing bandits setting, whereas one of the arms is chosen 
by the user in the Coactive Learning setting. 

While feedback in Coactive Learning takes the form 
of a preference, it is different fro m ordinal regression 
and r anking. Ordinal regression (jCrammer fc Singer . 
20011 ) assumes training examples (x, y), where y is a 
rank. In the Coactive Learning model, absolute ranks 
are never reveale d. Closely related is learning with 
pairs of examples (Herbrich et al. . 2000t Freund et al 



2003t IChu fc Ghahramanil . 120051 ) where absolute ranks 
are not needed; however, existing approaches require 
an iid assumption and typically perform batch learn- 
ing. There is also a large body of work on ranking 
(see (|Liull2009l n. These approaches are different from 
Coactive Learning; they require training data (x, y) 
where y is the optimal ranking for query x. 

3. Coactive Learning Model 

We now introduce coactive learning as a model of in- 
teraction (in rounds) between a learning system (e.g. 
search engine) and a human (e.g. search user) where 
both the human and learning algorithm have the same 
goal (of obtaining good results). At each round t, the 
learning algorithm observes a context x t € X (e.g. a 
search query) and presents a structured object yt € y 
(e.g. a ranked list of URLs). The utility of y t G y to 
the user for context x t S X is described by a utility 
function C/(x t ,y t ), which is unknown to the learning 
algorithm. As feedback the user returns an improved 
object y t e y (e.g. reordered list of URLs), i.e., 



when such an object yt exists. In fact, we will also 
allow violations of (JTJ) when we formally model user 
feedback in Section 13.11 The process by which the 
user generates the feedback y t can be understood as 
an approximate utility-maximizing search, but over a 
user-defined subset 3^ of all possible y. This mod- 
els an approximately and boundedly rational user that 
may employ various tools (e.g., query reformulations, 
browsing) to perform this search. Importantly, how- 
ever, the feedback y t is typically not the optimal label 



y* := argmax yGi ;U'(x 4 ,y). 



(2) 



In this way, Coactive Learning covers settings where 
the user cannot manually optimize the argmax over the 
full y (e.g. produce the best possible ranking in web- 
search) , or has difficulty expressing a bandit-style car- 
dinal rating for y t in a consistent manner. This puts 
our preference feedback y t in stark contrast to super- 
vised learning approaches which require (x t , y \ ) . But 
even more importantly, our model implies that reliable 
preference feedback {T]) can be derived from observable 
user behavior (i.e., clicks), as we will demonstrate in 
Section 13.21 for web-search. We conjecture that simi- 
lar feedback strategies also exist for other applications, 
where users can be assumed to act approximately and 
boundedly rational according to U . 

Despite the weak preference feedback, the aim of a 
coactive learning algorithm is to still present objects 
with utility close to that of the optimal y^ . Whenever, 
the algorithm presents an object y t under context x t , 
we say that it suffers a regret J7(x t ,y*) — U(x t ,y t ) at 
time step t. Formally, we consider the average regret 
suffered by an algorithm over T steps as follows: 



REGi 



x t ,y t *)-t/(x t ,y t )). (3) 



t=i 



f7(x t ,y t ) > C/(x t ,y t ), 



(1) 



The goal of the learning algorithm is to minimize 
REGt, thereby providing the human with predictions 
yt of high utility. Note, however, that a cardinal value 
of U is never observed by the learning algorithm, but 
U is only revealed ordinally through preferences ([1]). 

3.1. Quantifying Preference Feedback Quality 

To provide any theoretical guarantees about the regret 
of a learning algorithm in the coactive setting, we need 
to quantify the quality of the user feedback. Note that 
this quantification is a tool for theoretical analysis, 
not a prerequisite or parameter to the algorithm. We 
quantify feedback quality by how much improvement 
y provides in utility space. In the simplest case, we 
say that user feedback is strictly a-informative when 
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DCG(x,ybar)-DCG(x,y) 



Figure 1. Cumulative distribution of utility differences be- 
tween presented ranking y and click-feedback ranking y in 
terms of DCG@10 for three experimental conditions and 
overall. 

the following inequality is satisfied: 

U(xt,y t ) - U(xt,y t ) > a(tf(x t> y t *) - U(xt,y t )). (4) 

In the above inequality, a £ (0, 1] is an unknown pa- 
rameter. Feedback is such that utility of ft is higher 
than that of yt by a fraction a of the maximum pos- 
sible utility range C/(xj, yj 1 ) — f/(x t , y f ). Violations of 
the above feedback model are allowed by introducing 
slack variables £t > 00 

tf(x t ,y t )-tf(x t ,y t )> a(tf(x t ,y*)-tf(x t ,y t ))-&. (5) 

We refer to the above feedback model as a-informative 
feedback. Note also that it is possible to express feed- 
back of any quality using (J5]) with an appropriate value 
of £t. Our regret bounds will contain £ t , quantifying 
to what extent the strict a-informative modeling as- 
sumption is violated. 

Finally, we will also consider an even weaker feedback 
model where a positive utility gain is only achieved in 
expectation over user actions: 

E t [Ufr ,y t ) - Vfr ,y t )] > a(t/(x t ,y*) - C/(x t ,y f )) - & . (6) 

We refer to the above feedback as expected a- 
informative feedback. In the above equation, the ex- 
pectation is over the user's choice of yt given y t under 
context Xt (i.e., under a distribution P Xf [yt|yt] which 
is dependent on x t ). 

3.2. User Study: Preferences from Clicks 

We now validate that reliable preferences as specified 
in Equation ([T]) can indeed be inferred from implicit 
user behavior. In particular, we focus on preference 
feedback from clicks i n web-search and dra w upon 
data from a user study ( Joachims et al. , 20071 ) . In this 



study, subjects (undergraduate students, n = 16) were 



asked to answer 10 questions - 5 informational, 5 navi- 
gational - using the Google search engine. All queries, 
result lists, and clicks were recorded. For each subject , 
queries were grouped into query chains by questior0. 
On average, each query chain contained 2.2 queries 
and 1.8 clicks in the result lists. 

We use the following strategy to infer a ranking y from 
the user's clicks: prcpend to the ranking y from the 
first query of the chain all results that the user clicked 
throughout the whole query chain. To assess whether 
£7(x, y) is indeed larger than f/(x,y) as assumed in 
our learning model, we measure utility in terms of a 
standard measure of retrieval quality from Information 
Retrieval. We use £GG@10(x,y) = ^tJ? , 

where r(x,y[z]) is the relevan ce score of the i - th do c- 
ument in ranking y (see e.g. ( Manning et all |2008| )V 
To get ground-truth relevance assessments r(x, d), five 
human assessors were asked to manually rank the set 
of results encountered during each query chain. We 
then linearly normalize the resulting ranks to a rela- 
tive relevance score r(x, d) £ [0..5] for each document. 

We can now evaluate whether the feedback ranking y 
is indeed better than the ranking y that was originally 
presented, i.e. DGG@10(x,y) > DCG@10(x, y). 
Figure Q] plots the Cumulative Distribution functions 
(CDFs) of DGG@10(x,y) - L>GG@10(x,y) for three 
experimental conditions, as well as the average over 
all conditions. All CDFs are shifted far to the right of 
0, showing that preference feedback from our strategy 
is highly accurate and informative. Focusing first on 
the average over all conditions, the utility difference is 
strictly positive on ~ 60% of all queries, and strictly 
negative on only ~ 10%. This imbalance is significant 
(binomial sign test, p < 0.0001). Among the remain- 
ing ~ 30% of cases where the DCG@10 difference is 
zero, 88% are due to y = y (i.e. click only on top 1 
or no click). Note that a learning algorithm can easily 
detect those cases and may explicitly eliminate them 
as feedback. Overall, this shows that implicit feedback 
can indeed produce accurate preferences. 

What remains to be shown is whether the reliability 
of the feedback is affected by the quality of the cur- 
rent prediction, i.e., ?7(x t ,y t ). In the user study, some 
users actually received results for which retrieval qual- 
ity was degraded on purpose. In particular, about one 
third of the subjects received Google's top 10 results in 
reverse order (condition "reversed" ) and another third 
received rankings with the top two positions swapped 
(condition "swapped"). As Figure [T] shows, we find 
that users provide accurate preferences across this sub- 



strictly speaking, the value of the slack variable de- 
pends on the choice of a and the definition of utility. How- 
ever, for brevity, we do not explicitly show this dependence. 



2 This was done manually, but can b e automated with 
high accuracy (|Jones fc Klinknerl. [2008h . 
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Algorithm 1 Preference Perception. 
Initialize wi <— 
for t = 1 to T do 

Observe x t 

Present y t <- argmax yg:i ;W t r 0(x t , y) 
Obtain feedback y t 

Update: w t+ i <- w t + 0(x t ,y t ) - </>(x t ,y t ) 
end for 



stantial range of retrieval quality. Intuitively, a worse 
retrieval system may make it harder to find good re- 
sults, but it also makes an easier baseline to improve 
upon. This intuition is formally captured in our def- 
inition of a-informative feedback. The optimal value 
of the a vs. £ trade-off, however, will likely depend 
on many application-specific factors, like user motiva- 
tion, corpus properties, and query difficulty. In the 
following, we therefore present algorithms that do not 
require knowledge of a, theoretical bounds that hold 
for any value of a, and experiments that explore a 
large range of a. 

4. Coactive Learning Algorithms 

In this section, we present algorithms for minimizing 
regret in the coactive learning model. In the rest of this 
paper, we use a linear model for the utility function, 



(*,y), 



(7) 



where £ R N is an unknown parameter vector and 
4> : X x y — >• R N is a joint feature map such that 
||0(x, y)\\e 2 < R for any x e X and ye}'. Note that 
both x and y can be structured objects. 

We start by presenting and analyzing the most ba- 
sic algorithm for the coactive learning model, which 
we call the Preference Perceptron (Algorithm [T]). The 
Preference Perceptron maintains a weight vector w t 
which is initialized to 0. At each time step t, the algo- 
rithm observes the context x t and presents an object y 
that maximizes w t r </>(x t , y). The algorithm then ob- 
serves user feedback y t and the weight vector w t is 
updated in the direction c/)(x t ,y t ) — (/)(x t ,y t ). 

Theorem 1 The average regret of the preference per- 
ceptron algorithm can be upper bounded, for any a G 
(0, 1] and for any w* as follows: 



REG T < 



1 T 



t + 



2i?||w* 



aVT 



(8) 



Proof First, consider ||wt+i|| 2 , we have, 
wJ +1 wt + i = wjwy + 2wJ(0(x<r, yx) — </>(xx,yx)) 

+ (0(xT,yr) - </>(x T ,y T )) T 0(x T ,y T ) - <M x t, Yt) 
< wJw T + 4R 2 < AR 2 T. 

On line one, we simply used our update rule from 
algorithm [TJ On line two, we used the fact that 
wj(0(xr,yr) — <^>(xy,yy)) < from the choice of 
Yt in Algorithm Q] and that ||^(x, y)| < R. Further, 
from the update rule in algorithm [TJ we have, 

wj +1 w* = wjw„ + (0(x T ,y T ) - </)(x T ,y T )) T w >1! 



^((7(x t ,y t )-(7(x t ,y t )) 



(9) 



t=i 



We now use the fact that wj +1 w„ < ||w*|| ||wt+i|| 
(Cauchy-Schwarz inequality), which implies 

T 

(U (x t , y t ) - Ufa , y t ) ) < 2R Vf\ | w, 1 1 . 

t=i 

From the a-informative modeling of the user feedback 
in we have 



n 



2 {Ufa, y* t ) - C/(x t , y t )) - J2 & ^ SflVTllw, 



t=i t=i 
from which the claimed result follows. 



The first term in the regret bound denotes the qual- 
ity of feedback in terms of violation of the strict a- 
informativc feedback. In particular, if the user feed- 
back is strictly a-informative, then all slack variables 
in © vanish and REG T = 0(1/ VT). 

Though user feedback is modeled via a-informative 
feedback, the algorithm itself docs not require the 
knowledge of a; a plays a role only in the analysis. 

Although the preference perceptron appears similar to 
the standard perceptron for multi-class classification 
problems, there are key differences. First, the stan- 
dard perceptron algorithm requires the true label y* 
as feedback, whereas much weaker feedback y suffices 
for our algorithm. Second, the standard analysis of 
the perceptron bounds the number of mistakes made 
by the algorithm based on margin and the radius of the 
examples. In contrast, our analysis bounds a different 
regret that captures a graded notion of utility. 

An appealing aspect of our learning model is that sev- 
eral interesting extensions are possible. We discuss 
some of them in the rest of this section. 
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4.1. Lower Bound 

We now show that the upper bound in Theorem [JJ 
cannot be improved in general. 

Lemma 2 For any coactive learning algorithm A with 
linear utility, there exist x 4 , objects y cmdw* such that 
REG T ofAinT steps is £1(1/ y/T). 

Proof Consider a problem where y = {— 1, +1}, A" = 
{x £ R T : ||x|| = 1}. Define the joint feature map 
as c/)(x,y) = yx. Consider T contexts ei, . . . , such 
that e.,- has only the j th component equal to one and 
all the others equal to zero. Let yi, . . .yx be the se- 
quence of outputs of A on contexts ei, . . . , ex- Con- 
struct w* = [-yi/VT -y 2 /VT Yt/VTY, we 

have for this construction ||w*|| = 1. Let the user 
feedback on the t th step be — yt- With these choices, 
the user feedback is always a-informative with a = 1 
since y* = — yt- Yet, the regret of the algorithm is 
?EL(w^(e ( ,y t *) - wj^et.yt)) = O(^). ■ 

4.2. Batch Update 

In some applications, due to high volumes of feedback, 
it might not be possible to do an update after every 
round. For such scenarios, it is natural to consider a 
variant of Algorithm Q] that makes an update every k 
iterations; the algorithm simply uses wt obtained from 
the previous update until the next update. It is easy 
to show the following regret bound for batch updates: 



1 = 1 



4.3. Expected a-Informative Feedback 

So far, we have characterized user behavior in terms 
of deterministic feedback actions. However, if a bound 
on the expected regret suffices, the weaker model of 
Expected a-Informative Feedback from Equation (HU) 
is applicable. 

Corollary 3 Under expected a-informative feedback 
model, the expected regret (over user behavior distri- 
bution) of the preference perceptron algorithm can be 
upper bounded as follows: 



E[REG T ] < 



1 T 



t + 



2i?l|w„| 
a\/T 



(10) 



The above corollary can be proved by following the 
argument of Theorem [TJ but taking expectations 
over user feedback: E[wJ +1 wt+i] = E[wJwt] + 



Algorithm 2 Convex Preference Perceptron. 
Initialize wi <— 
for t = 1 to T do 

Set Vt <- ^ 
Observe xt 

Present y t <- argmax ye j ; w t r 0(x t , y) 
Obtain feedback yt 

Update: w t+ i <- w t + TftG(<f>(x.t, ft) -<Kx t ,y t )) 



Project: w t+ i 
end for 



argmm ueB ||u- w t+ i| 



E[2wJ(^(x T ,y T ) - 0(x T ,y T ))] + E T [(0(x T , y T ) - 
(/>(x T ,y T )) T 0(x T ,y T ) - (f)(x T ,y T )} < E[wJw T ] + 
AR 2 . In the above, E denotes expectation over all user 
feedback y t given y t under the context x t . It follows 
that E[wJ +1 w T+ i] < 4TR 2 . 

Applying Jensen's inequality on the concave func- 
tion ^f, we get: E[wjw„] < ||w*||E[||wr||] < 



||w*|| y E[wjwr]- The corollary follows from the def- 
inition of expected a-informative feedback. 

4.4. Convex Loss Minimization 

We now generalize our results to minimize convex 
losses defined on the linear utility differences. We as- 
sume that at every time step t, there is an (unknown) 
convex loss function ct ■ R — > R which determines the 
loss c t (£/(x t ,y t ) — C/(x t ,yj)) at time t. The functions 
Ct are assumed to be non-increasing. Further, sub- 
derivatives of the ct's are assumed to be bounded (i.e., 
d t (6) £ [-G,0] for all t and for all 6 £ R). The vector 
w* which determines the utility of yt under context 
x t is assumed from a closed and bounded convex set 
B whose diameter is denoted as \B\. 

Algorithm [5] minimizes the average convex loss. There 
are two differences between this algorithm and Algo- 
rithm [TJ Firstly, there is a rate r}t associated with the 
update at time t. Moreover, after every update, the re- 
sulting vector w t+ i is projected back to the set B. We 
have the following result for Algorithm [21 a proof of 
which is provided in an extended version of this paper 
(|Shivaswamv fc Joachimsl [2012). 



Theorem 4 For the convex preference perceptron, we 
have, for any a £ (0, 1] and any w* £ B, 

T T 

- ct(U(x u y t ) - C/(xt,y*)) < - ]T ct (0) 



T 
2G 



aT 7=i a V2\/T T 



t=i 

T 



t=l 



1 (\B\G \B\G 4R 2 G\ 



T 
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In the bound (| 1 1|> . ct(0) is the minimum possible 
convex loss since C/(x t ,y t ) — C/(x t ,yj) can never be 
greater than zero by definition of y* t . Thus the the- 
orem upper bounds the average convex loss via the 
minimum achievable loss and the quality of feedback. 
Like the previous result (Theorem [1]), under strict a- 
informativc feedback, the average loss approaches the 
best achievable loss at 0(1/ yT) albeit with larger con- 
stant factors. 

5. Experiments 

We empirically evaluated the Preference Perceptron 
algorithm on two datasets. The two experiments dif- 
fered in the nature of prediction and feedback. While 
the algorithm operated on structured objects (rank- 
ings) in one experiment, atomic items (movies) were 
presented and received as feedback in the other. 

5.1. Structured Feedback: Learning to Rank 

We evaluated our Preference Perceptron algo- 
rithm on the Yahoo! learning to rank dataset 
(IChapelle fc Chanel . 1201 lh . This dataset consists of 
query-url feature vectors (denoted as xf for query q 
and URL i) , each with a relevance rating r\ that ranges 
from zero (irrelevant) to four (perfectly relevant). To 
pose ranking structured prediction problem, we 
defined our joint feature map as follows: 



, T #z,y)=]T 



w 'Xy, 



=i log(i + 1) 



(12) 



In the above equation, y denotes a ranking such that 
Yi is the index of the URL which is placed at position 
i in the ranking. Thus, the above measure considers 
the top five URLs for a query q and computes a score 
based on a graded relevance. Note that the above util- 
ity function define d via the feature- map is analogous to 



DCG@5 (see e.g. ([Manning et al.l . l2008[ )) after replac 



ing the relevance label with a linear prediction based 
on the features. 

For query q t at time step t, the Preference Percep- 
tron algorithm presents the ranking y* that maximizes 
w t T <f)(q t ,y). Note that this merely amounts to sort- 
ing documents by the scores w^x*', which can be 
done very efficiently. The utility regret in Eqn. ([3]), 
based on the definition of utility in (fl2|). is given by 

TEj=i^J(H<lt,y qt *)-H<lt,y qt ))- Herey*'* denotes 
the optimal ranking with respect to w», which is the 
best least squares fit to the relevance labels from the 
features using the entire dataset. Query ordering was 
randomly permuted twenty times and we report aver- 
age and standard error of the results. 



5.1.1. Strong Vs Weak Feedback 

The goal of the first experiment was to see how the 
regret of the algorithm changes with feedback quality. 
To get feedback at different quality levels a, we used 
the following mechanism. Given the predicted ranking 
y t , the user would go down the list until she found 
five URLs such that, when placed at the top of the 
list, the resulting y t satisfied the strictly a- informative 
feedback condition w.r.t. the optimal w*. 




Figure 2. Regret based on strictly a-informative feedback. 

Figure [2] shows the results for this experiment for 
two different a values. As expected, the regret with 
a = 1.0 is lower compared to the regret with respect 
a = 0.1. Note, however, that the difference between 
the two curves is much smaller than a factor of ten. 
This is because strictly a-informative feedback is also 
strictly /3-informative feedback for any f3 < a. So, 
there could be several instances where user feedback 
was much stronger than what was required. As ex- 
pected from the theoretical bounds, since the user 
feedback is based on a linear model with no noise, util- 
ity regret approaches zero. 

5.1.2. Noisy Feedback 

In the previous experiment, user feedback was based 
on actual utility values computed from the optimal 
w*. We next make use of the actual relevance labels 
provided in the dataset for user feedback. Now, given 
a ranking for a query, the user would go down the list 
inspecting the top 10 URLs (or all the URLs if the 
list is shorter) as before. Five URLs with the highest 
relevance labels (r?) are placed at the top five locations 
in the user feedback. Note that this produces noisy 
feedback since no linear model can perfectly fit the 
relevance labels on this dataset. 

As a baseline, we repeatedly trained a conventional 
Ranking SV]V|f|. At each iteration, the previous SVM 
model was used to present a ranking to the user. The 
user returned a ranking based on the relevance la- 
bels as above. The pairs of examples (<7t,yf£, m ) and 
(qt,ysv m ) were used as training pairs for the ranking 



'http : / / svmlight . j oachims . org 



Coactive Learning 



SVMs. Note that training a ranking SVM after each 
iteration would be prohibitive, since it involves solv- 
ing a quadratic program and cross- validating the reg- 
ularization parameter C . Thus, we retrained the SVM 
whenever 10% more examples were added to the train- 
ing set. The first training was after the first iteration 
with just one pair of examples (starting with a random 
y 91 ), and the C value was fixed at 100 until there were 
50 pairs of examples, when reliable cross-validation be- 
came possible. After there were more than 50 pairs in 
the training set, the C value was obtained via five- fold 
cross-validation. Once the C value was determined, 
the SVM was trained on all the training examples 
available at that time. The same SVM model was then 
used to present rankings until the next retraining. 
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Figure 3. Regret vs time based on noisy feedback. 

Results of this experiment are presented in Figure [3] 
Since the feedback is now based on noisy relevance la- 
bels, the utility regret converges to a non-zero value 
as predicted by our theoretical results. Over most 
of the range, the Preference Perceptron performs sig- 
nificantljQ better than the SVM. Moreover, the per- 
ceptron experiment took around 30 minutes to run, 
whereas the SVM experiment took about 20 hours on 
the same machine. We conjecture that the regret val- 
ues for both the algorithms can be improved with bet- 
ter features or kernels, but these extensions are orthog- 
onal to the main focus of this paper. 

5.2. Item Feedback: Movie Recommendation 

In contrast to the structured prediction problem in 
the previous section, we now evaluate the Preference 
Perceptron on a task with atomic predictions, namely 
movie recommendation. In each iteration a movie is 
presented to the user, and the feedback consists of a 
movie as well. We use the MovieLens dataset, which 
consists of a million ratings over 3090 movies rated by 
6040 users. The movie ratings ranged from one to five. 

We randomly divided users into two equally sized sets. 
The first set was used to obtain a feature vector uij for 
each movie j using the "S VD embedding" meth od for 
collaborative filtering (see ( Bell fc KorenL 2007 ). Eqn. 



(15)). The dimensionality of the feature vectors and 
the regularization parameters were chosen to optimize 
cross-validation accuracy on the first dataset in terms 
of squared error. For the second set of users, we then 
considered the problem of recommending movies based 
on the movie features uij . This experiment setup sim- 
ulates the task of recommending movies to a new user 
based on movie features from old users. 

For each user i in the second set, we found the best 
least squares approximation w^mj to the user's util- 
ity functions on the available ratings. This enables us 
to impute utility values for movies that were not ex- 
plicitly rated by this user. Furthermore, it allows us to 
measure regret for each user as y Y^t=i w i*( m t* — m t)> 
which is the average difference in utility between the 
recommended movie m t and the best available movie 
m tll ,. We denote the best available movie at time t 
by m t *, since in this experiment, once a user gave a 
particular movie as feedback, both the recommended 
movie and the feedback movie were removed from the 
set of candidates for subsequent recommendations. 

5.2.1. Strong Vs Weak Feedback 

Analogous to the web-search experiments, we first ex- 
plore how the performance of the Preference Percep- 
tron changes with feedback quality a. In particular, we 
recommended a movie with maximum utility accord- 
ing to the current w t of the algorithm, and the user 
returns as feedback a movie with the smallest utility 
that still satisfied strictly a-informative feedback ac- 
cording to w,*. For every user in the second set, the 
algorithm iteratively recommended 1500 movies in this 
way. Regret was calculated after each iteration and 
separately for each user, and all regrets were averaged 
over all the users in the second set. 




4 The error bars are extremely tiny at higher iterations. 



Figure 4. Regret for strictly a-informative feedback. 

Figure 2] shows the results for this experiment. Since 
the feedback in this case is strictly a-informative, the 
average regret in all the cases decreases towards zero 
as expected. Note that even for a moderate value of 
a, regret is already substantially reduced after 10's of 
iterations. With higher a values, the regret converges 
to zero at a much faster rate than with lower a values. 
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5.2.2. Noisy Feedback 

Wc now consider noisy feedback, where the user feed- 
back does not necessarily match the linear utility 
model used by the algorithm. In particular, feedback is 
now given based on the actual ratings when available, 
or the score ujrrij rounded to the nearest allowed rat- 
ing value. In every iteration, the user returned a movie 
with one rating higher than the one presented to her. 
If the algorithm already presented a movie with the 
highest rating, it was assumed that the user gave the 
same movie as feedback. 
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Figure 5. Regret based on noisy feedback. 

As a baseline, we again ran a ranking SVM. Like in 
the web-search experiment, it was retrained whenever 
10% more training data was added. The results for 
this experiment are shown in Figure [5] The regret of 
the Preference Perceptron is again significantly lower 
than that of the SVM, and at a small fraction of the 
computational cost. 

6. Conclusions 

We proposed a new model of online learning where 
preference feedback is observed but cardinal feedback 
is never observed. We proposed a suitable notion of 
regret and showed that it can be minimized under 
our feedback model. Further, we provided several ex- 
tensions of the model and algorithms. Furthermore, 
experiments demonstrated its effectiveness for web- 
search ranking and a movie recommendation task. A 
future direction is to consider A-strongly convex func- 
tions, and wc conjecture it is possible to derive algo- 
rithms with 0(log(T)/T) regret in this case. 
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